I'm training everything with A100-80G
1. 3D Gaussian Splatting¶
1.1.5 Perform Splatting¶
1.2.2 Perform Forward Pass and Compute Loss¶
"wriggling gaussians"
Evaluation --- Mean PSNR: 28.489
Evaluation --- Mean SSIM: 0.930
parameters = [
{"params": [gaussians.pre_act_opacities], "lr": 0.05, "name": "opacities"},
{"params": [gaussians.pre_act_scales], "lr": 0.01, "name": "scales"},
{"params": [gaussians.colors], "lr": 0.05, "name": "colors"},
{"params": [gaussians.means], "lr": 0.001, "name": "means"},
]
1k iterations, took 19min 29s final loss = 0.008, reached after 100 iterations so shouldn't take that many iterations
1.3.1 Rendering Using Spherical Harmonics¶
   Â
Not view-dependent                    View-dependent
   Â
Not view-dependent                    View-dependent
I think it's quite evident, the velvet on the chair has complicated BRDF, resulting in pseudo-shadow
Besides, the golden imprints appears more specular
   Â
Not view-dependent                    View-dependent
Also pseudo-shadow at the edge of the chair
2. Diffusion-guided Optimization¶
2.1 SDS Loss + Image Optimization¶
   Â
Without guidance (2000 iterations) a hamburger                    With guidance (2000 iterations) a hamburger
   Â
Without guidance (2000 iterations) a standing corgi dog                    With guidance (2000 iterations) a standing corgi dog
   Â
Without guidance (2000 iterations) diffusion                    With guidance (2000 iterations) diffusion
   Â
Without guidance (2000 iterations) Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â With guidance (2000 iterations) (this is bull shit)
a snake poking its head out of a water bottle like aquarius, while its body swirls at the bottom of the bottle
2.2 Texture Map Optimization for Mesh¶
A hamburger
A village like Shire in the lord of the rings
I think random viewpoint and light source would worsen result, because we cannot render or "imprint" the image onto the mesh from a particular viewpoint, but rather from an "averaged" viewpoint. Eventually, we are only "imprinting" the dominant color,
The geometry is fixed, and we do not encode viewpoint information in the text prompt when we generate latents,
2.3 NeRF Optimization¶
$\lambda_{entropy} = 10^{-2}$, $\lambda_{orient} = 10^{-3}$, latent iteration: $20\%$
A standing corgi dog
Final loss after 100 epochs, 2291.58 seconds: 0.445
A hamburger
Final loss after 100 epochs, 2438.83 seconds: 0.315
A Slytherin snake
Final loss after 100 epochs, 2288.24 seconds: 0.648
(pretty sure this is not a snake)
2.4.1 View-dependent text embedding¶
A standing corgi dog
Final loss after 100 epochs, 2293.36 seconds: 0.563
A hamburger
Final loss after 100 epochs, 2300.36 seconds: 0.769
A Slytherin snake
Final loss after 100 epochs, 2300.79 seconds: 0.439
The view-dependent text encoding fixed the no-head problem with snake, this is because having multiple views can reveal the obstructed geometry
However, I think this is still a hacky solution, because simply appending the view text does not guarantee that the diffusion model understands or focuses on different views in either CV or NLP sense. This is evident in the corgi dog experiments, we had better results even without view-dependent text embedding.